Skip to content

Conversation

@SupernaviX
Copy link
Collaborator

@SupernaviX SupernaviX commented Nov 5, 2025

Fixes #204

Implements a peer-network-interface module. This module runs the ChainSync and BlockFetch protocols against a small set of explicitly-configured peers. It follows the fork defined by the first peer in the list, but will switch to other forks if that peer disconnects.

Testing strategy was a combination of unit tests of the ChainState struct, and manual testing against three preview nodes on my laptop which I randomly killed and revived.

Includes a small architecture diagram: https://github.com/input-output-hk/acropolis/blob/sg/peer-network-interface/modules/peer_network_interface/NOTES.md

Manual testing

To test it, you can run the omnibus process using the "local" configuration:

cd processes/omnibus
cargo run -- --config omnibus-local.toml

That configuration tries connecting to three Cardano nodes running against the preview environment, on ports 3001 3002 and 3003. To create such a setup, you can use this gist https://gist.github.com/SupernaviX/16627499dae71092abeac96434e96817

Hoisted comments...

The main method I used to test fallover was starting from the origin (by setting sync-point = "origin" in omnibus-local.toml), and then using docker stop and docker start to stop and start the three cardano nodes while it synced. The module emits a log line for every 1000 messages it produces, which happens pretty rapidly when syncing from origin. It also emits log lines when a node is disconnected (and tries reconnecting every 5 seconds). So what you see in logs with this setup is

While all nodes are running, it just keeps logging that progress is being made every few seconds ("Published block 5999").
When one or two nodes stop running, it keeps logging that progress is made, and also logs that there are connection issues every 5 seconds or so.
When all nodes are stopped, it logs that there are connection issues and does not log that progress is being made.
When a previously-stopped node starts again, it stops logging connection issues (for that node), and goes back to publishing blocks if it hadn't been before.

There's other points it can start from too:

from the last block in a newly-restored snapshot
from a local cache which it writes as it syncs (very slow and unreliable until 

#341)
from the tip (pretty reliable, but blocks only come every ~20 seconds so the logs don't show many signs of life)

@buddhisthead
Copy link
Collaborator

I'd like to run this on my computer, if that's possible. Can you give some detailed instructions in the PR for how to do that, please?

@SupernaviX
Copy link
Collaborator Author

I'd like to run this on my computer, if that's possible. Can you give some detailed instructions in the PR for how to do that, please?

Sorry for the delay, I missed this yesterday. I just attached setup instructions.

@buddhisthead
Copy link
Collaborator

I'd like to run this on my computer, if that's possible. Can you give some detailed instructions in the PR for how to do that, please?

Sorry for the delay, I missed this yesterday. I just attached setup instructions.

Thanks. And how will I know that it's working? That it prints certain log messages? You mention killing and reviving the other nodes. Can you briefly describe that sequence and what I should expect?

@SupernaviX
Copy link
Collaborator Author

SupernaviX commented Nov 7, 2025

I'd like to run this on my computer, if that's possible. Can you give some detailed instructions in the PR for how to do that, please?

Sorry for the delay, I missed this yesterday. I just attached setup instructions.

Thanks. And how will I know that it's working? That it prints certain log messages? You mention killing and reviving the other nodes. Can you briefly describe that sequence and what I should expect?

Yep! The main method I used to test fallover was starting from the origin (by setting sync-point = "origin" in omnibus-local.toml), and then using docker stop and docker start to stop and start the three cardano nodes while it synced. The module emits a log line for every 1000 messages it produces, which happens pretty rapidly when syncing from origin. It also emits log lines when a node is disconnected (and tries reconnecting every 5 seconds). So what you see in logs with this setup is

  • While all nodes are running, it just keeps logging that progress is being made every few seconds ("Published block 5999").
  • When one or two nodes stop running, it keeps logging that progress is made, and also logs that there are connection issues every 5 seconds or so.
  • When all nodes are stopped, it logs that there are connection issues and does not log that progress is being made.
  • When a previously-stopped node starts again, it stops logging connection issues (for that node), and goes back to publishing blocks if it hadn't been before.

There's other points it can start from too:

  • from the last block in a newly-restored snapshot
  • from a local cache which it writes as it syncs (very slow and unreliable until Change format of upstream cache #341)
  • from the tip (pretty reliable, but blocks only come every ~20 seconds so the logs don't show many signs of life)

@buddhisthead
Copy link
Collaborator

buddhisthead commented Nov 11, 2025

thread 'main' panicked at /Users/chris/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/caryatid_process-0.12.1/src/process.rs:125:30:
called `Result::unwrap()` on an `Err` value: Pointer cache directory 'cache' does not exist.

I ran this before I started any cardano nodes. It probably shouldn't panic, but rather print something and exit nicely.

@buddhisthead
Copy link
Collaborator

Sorry, but I'm stuck on setting up the environment for this test. I got as far as trying to startup the cardano containers. Your gist hints at nice things, but doesn't explain how to use them. I realize this is sort of common knowledge for most cardano developers so perhaps point to a document somewhere. Or make a nice gist the explains it all and then just reference that in the future for testing.

@buddhisthead
Copy link
Collaborator

Okay, figured out that I have to run restore.sh before I run startup.sh to create the configuration directories. But db fails because there is no aarch64 for mithril. Seems to still startup but I imagine mithril chain fetch won't work.

@buddhisthead
Copy link
Collaborator

I was able to start the cardano images, but when I run the omni-bus process, I still get:

thread 'main' panicked at /home/parallels/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/caryatid_process-0.12.1/src/process.rs:125:30:
called `Result::unwrap()` on an `Err` value: Pointer cache directory 'cache' does not exist.

So I need some help knowing what creates the cache directory.

Copilot finished reviewing on behalf of buddhisthead November 11, 2025 04:21
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements a new peer-network-interface module that provides a more robust alternative to the existing upstream chain fetcher. The module uses the ChainSync and BlockFetch protocols to fetch blocks from multiple configured upstream peers, following one preferred chain while supporting graceful failover to other peers during network issues.

Key changes:

  • Introduces the PeerNetworkInterface module with event-driven architecture supporting multiple upstream peers
  • Refactors UpstreamCache into common for reuse across both upstream chain fetcher implementations
  • Adds support for the preview network in genesis bootstrapper

Reviewed Changes

Copilot reviewed 21 out of 22 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
modules/peer_network_interface/src/peer_network_interface.rs Main module implementation handling initialization, cache management, and block publishing
modules/peer_network_interface/src/network.rs NetworkManager coordinating multiple peer connections and chain state
modules/peer_network_interface/src/chain_state.rs ChainState tracking block announcements, rollbacks, and publishing queue across multiple peers
modules/peer_network_interface/src/connection.rs PeerConnection managing individual peer connections using ChainSync and BlockFetch protocols
modules/peer_network_interface/src/configuration.rs Configuration loading and sync point options
modules/peer_network_interface/config.default.toml Default configuration with mainnet backbone nodes
modules/peer_network_interface/Cargo.toml Package definition and dependencies
modules/peer_network_interface/README.md Module documentation and usage guide
modules/peer_network_interface/NOTES.md Architecture diagram and design notes
common/src/upstream_cache.rs Refactored cache implementation moved from upstream_chain_fetcher for reuse
common/src/lib.rs Export upstream_cache module
common/src/genesis_values.rs Added kebab-case serde attribute for configuration deserialization
modules/upstream_chain_fetcher/src/upstream_chain_fetcher.rs Updated to use refactored UpstreamCache from common
modules/upstream_chain_fetcher/src/body_fetcher.rs Updated imports for UpstreamCache
modules/genesis_bootstrapper/src/genesis_bootstrapper.rs Added preview network genesis support
modules/genesis_bootstrapper/build.rs Download preview network genesis files
processes/omnibus/src/main.rs Register PeerNetworkInterface module
processes/omnibus/Cargo.toml Add peer_network_interface dependency
processes/omnibus/omnibus-local.toml Local configuration for testing with preview network
processes/omnibus/.gitignore Ignore upstream-cache directory
Cargo.toml Add peer_network_interface to workspace members
Cargo.lock Lock file updates for new module

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@SupernaviX
Copy link
Collaborator Author

I was able to start the cardano images, but when I run the omni-bus process, I still get:

thread 'main' panicked at /home/parallels/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/caryatid_process-0.12.1/src/process.rs:125:30:
called `Result::unwrap()` on an `Err` value: Pointer cache directory 'cache' does not exist.

So I need some help knowing what creates the cache directory.

It looked like the stake_delta_filter module was supposed to do this, and wasn't. it now does.

@buddhisthead
Copy link
Collaborator

I was able to get it running along with the cardano nodes. I had the cardano nodes running, started omnibus, then killed all 3 nodes, then started them all up again. It stopped producing block messages after the first kill (all three) but did not resume it's messages of blocks after starting all three again. Am I abusing it too much?

2025-11-12T02:06:51.798514Z  INFO acropolis_module_peer_network_interface::network: Published block 3799
2025-11-12T02:06:51.890232Z  INFO acropolis_module_peer_network_interface::network: Published block 3899
2025-11-12T02:06:51.965319Z ERROR acropolis_module_peer_network_interface::connection: error while sending or receiving data through the channel peer="localhost:3002"
2025-11-12T02:06:51.965371Z  WARN acropolis_module_peer_network_interface::network: disconnected from localhost:3002
2025-11-12T02:06:51.971438Z  INFO acropolis_module_peer_network_interface::network: Published block 3999
2025-11-12T02:06:52.015147Z ERROR acropolis_module_peer_network_interface::connection: error while sending or receiving data through the channel peer="localhost:3001"
2025-11-12T02:06:52.015199Z  WARN acropolis_module_peer_network_interface::network: disconnected from localhost:3001
2025-11-12T02:06:52.015207Z  INFO acropolis_module_peer_network_interface::network: setting preferred upstream to localhost:3003
2025-11-12T02:06:52.017991Z  INFO acropolis_module_peer_network_interface::network: Published block 4099
2025-11-12T02:06:52.020883Z  INFO acropolis_module_peer_network_interface::network: Published block 4199
2025-11-12T02:06:52.023894Z  INFO acropolis_module_peer_network_interface::network: Published block 4299
2025-11-12T02:06:52.026742Z  INFO acropolis_module_peer_network_interface::network: Published block 4399
2025-11-12T02:06:52.028818Z ERROR acropolis_module_peer_network_interface::connection: error while sending or receiving data through the channel peer="localhost:3003"
2025-11-12T02:06:52.028862Z  WARN acropolis_module_peer_network_interface::network: disconnected from localhost:3003
2025-11-12T02:06:52.028876Z  INFO acropolis_module_peer_network_interface::network: setting preferred upstream to localhost:3002
2025-11-12T02:06:56.972588Z ERROR acropolis_module_peer_network_interface::connection: error connecting bearer: Connection refused (os error 111) peer="localhost:3002"
2025-11-12T02:06:56.972799Z  WARN acropolis_module_peer_network_interface::network: disconnected from localhost:3002
2025-11-12T02:06:56.972853Z  INFO acropolis_module_peer_network_interface::network: setting preferred upstream to localhost:3001
2025-11-12T02:06:57.024671Z ERROR acropolis_module_peer_network_interface::connection: error connecting bearer: Connection refused (os error 111) peer="localhost:3001"
2025-11-12T02:06:57.024768Z  WARN acropolis_module_peer_network_interface::network: disconnected from localhost:3001
2025-11-12T02:06:57.024794Z  INFO acropolis_module_peer_network_interface::network: setting preferred upstream to localhost:3003
2025-11-12T02:06:57.035544Z ERROR acropolis_module_peer_network_interface::connection: error connecting bearer: Connection refused (os error 111) peer="localhost:3003"
2025-11-12T02:06:57.035647Z  WARN acropolis_module_peer_network_interface::network: disconnected from localhost:3003
2025-11-12T02:06:57.035682Z  INFO acropolis_module_peer_network_interface::network: setting preferred upstream to localhost:3002
2025-11-12T02:07:01.976445Z ERROR acropolis_module_peer_network_interface::connection: error connecting bearer: Connection refused (os error 111) peer="localhost:3002"
2025-11-12T02:07:01.976590Z  WARN acropolis_module_peer_network_interface::network: disconnected from localhost:3002
2025-11-12T02:07:01.976640Z  INFO acropolis_module_peer_network_interface::network: setting preferred upstream to localhost:3001
2025-11-12T02:07:02.027124Z ERROR acropolis_module_peer_network_interface::connection: error connecting bearer: Connection refused (os error 111) peer="localhost:3001"
2025-11-12T02:07:02.027246Z  WARN acropolis_module_peer_network_interface::network: disconnected from localhost:3001
2025-11-12T02:07:02.027270Z  INFO acropolis_module_peer_network_interface::network: setting preferred upstream to localhost:3003
2025-11-12T02:07:02.036736Z ERROR acropolis_module_peer_network_interface::connection: error connecting bearer: Connection refused (os error 111) peer="localhost:3003"
2025-11-12T02:07:02.036790Z  WARN acropolis_module_peer_network_interface::network: disconnected from localhost:3003
2025-11-12T02:07:02.036800Z  INFO acropolis_module_peer_network_interface::network: setting preferred upstream to localhost:3002
2025-11-12T02:07:06.980642Z ERROR acropolis_module_peer_network_interface::connection: handshake protocol error peer="localhost:3002"
2025-11-12T02:07:06.980696Z  WARN acropolis_module_peer_network_interface::network: disconnected from localhost:3002
2025-11-12T02:07:06.980709Z  INFO acropolis_module_peer_network_interface::network: setting preferred upstream to localhost:3001
2025-11-12T02:07:07.030348Z ERROR acropolis_module_peer_network_interface::connection: handshake protocol error peer="localhost:3001"
2025-11-12T02:07:07.030421Z  WARN acropolis_module_peer_network_interface::network: disconnected from localhost:3001
2025-11-12T02:07:07.030433Z  INFO acropolis_module_peer_network_interface::network: setting preferred upstream to localhost:3003
2025-11-12T02:07:07.040029Z ERROR acropolis_module_peer_network_interface::connection: handshake protocol error peer="localhost:3003"
2025-11-12T02:07:07.040076Z  WARN acropolis_module_peer_network_interface::network: disconnected from localhost:3003
2025-11-12T02:07:07.040088Z  INFO acropolis_module_peer_network_interface::network: setting preferred upstream to localhost:3002
2025-11-12T02:07:11.981887Z ERROR acropolis_module_peer_network_interface::connection: handshake protocol error peer="localhost:3002"
2025-11-12T02:07:11.981942Z  WARN acropolis_module_peer_network_interface::network: disconnected from localhost:3002
2025-11-12T02:07:11.981958Z  INFO acropolis_module_peer_network_interface::network: setting preferred upstream to localhost:3001
2025-11-12T02:07:12.032279Z ERROR acropolis_module_peer_network_interface::connection: handshake protocol error peer="localhost:3001"
2025-11-12T02:07:12.032328Z  WARN acropolis_module_peer_network_interface::network: disconnected from localhost:3001
2025-11-12T02:07:12.032341Z  INFO acropolis_module_peer_network_interface::network: setting preferred upstream to localhost:3003
2025-11-12T02:07:17.034395Z ERROR acropolis_module_peer_network_interface::connection: handshake protocol error peer="localhost:3001"
2025-11-12T02:07:17.034450Z  WARN acropolis_module_peer_network_interface::network: disconnected from localhost:3001

@SupernaviX
Copy link
Collaborator Author

I was able to get it running along with the cardano nodes. I had the cardano nodes running, started omnibus, then killed all 3 nodes, then started them all up again. It stopped producing block messages after the first kill (all three) but did not resume it's messages of blocks after starting all three again. Am I abusing it too much?

2025-11-12T02:06:51.798514Z  INFO acropolis_module_peer_network_interface::network: Published block 3799
2025-11-12T02:06:51.890232Z  INFO acropolis_module_peer_network_interface::network: Published block 3899
2025-11-12T02:06:51.965319Z ERROR acropolis_module_peer_network_interface::connection: error while sending or receiving data through the channel peer="localhost:3002"
2025-11-12T02:06:51.965371Z  WARN acropolis_module_peer_network_interface::network: disconnected from localhost:3002
2025-11-12T02:06:51.971438Z  INFO acropolis_module_peer_network_interface::network: Published block 3999
2025-11-12T02:06:52.015147Z ERROR acropolis_module_peer_network_interface::connection: error while sending or receiving data through the channel peer="localhost:3001"
2025-11-12T02:06:52.015199Z  WARN acropolis_module_peer_network_interface::network: disconnected from localhost:3001
2025-11-12T02:06:52.015207Z  INFO acropolis_module_peer_network_interface::network: setting preferred upstream to localhost:3003
2025-11-12T02:06:52.017991Z  INFO acropolis_module_peer_network_interface::network: Published block 4099
2025-11-12T02:06:52.020883Z  INFO acropolis_module_peer_network_interface::network: Published block 4199
2025-11-12T02:06:52.023894Z  INFO acropolis_module_peer_network_interface::network: Published block 4299
2025-11-12T02:06:52.026742Z  INFO acropolis_module_peer_network_interface::network: Published block 4399
2025-11-12T02:06:52.028818Z ERROR acropolis_module_peer_network_interface::connection: error while sending or receiving data through the channel peer="localhost:3003"
2025-11-12T02:06:52.028862Z  WARN acropolis_module_peer_network_interface::network: disconnected from localhost:3003
2025-11-12T02:06:52.028876Z  INFO acropolis_module_peer_network_interface::network: setting preferred upstream to localhost:3002
2025-11-12T02:06:56.972588Z ERROR acropolis_module_peer_network_interface::connection: error connecting bearer: Connection refused (os error 111) peer="localhost:3002"
2025-11-12T02:06:56.972799Z  WARN acropolis_module_peer_network_interface::network: disconnected from localhost:3002
2025-11-12T02:06:56.972853Z  INFO acropolis_module_peer_network_interface::network: setting preferred upstream to localhost:3001
2025-11-12T02:06:57.024671Z ERROR acropolis_module_peer_network_interface::connection: error connecting bearer: Connection refused (os error 111) peer="localhost:3001"
2025-11-12T02:06:57.024768Z  WARN acropolis_module_peer_network_interface::network: disconnected from localhost:3001
2025-11-12T02:06:57.024794Z  INFO acropolis_module_peer_network_interface::network: setting preferred upstream to localhost:3003
2025-11-12T02:06:57.035544Z ERROR acropolis_module_peer_network_interface::connection: error connecting bearer: Connection refused (os error 111) peer="localhost:3003"
2025-11-12T02:06:57.035647Z  WARN acropolis_module_peer_network_interface::network: disconnected from localhost:3003
2025-11-12T02:06:57.035682Z  INFO acropolis_module_peer_network_interface::network: setting preferred upstream to localhost:3002
2025-11-12T02:07:01.976445Z ERROR acropolis_module_peer_network_interface::connection: error connecting bearer: Connection refused (os error 111) peer="localhost:3002"
2025-11-12T02:07:01.976590Z  WARN acropolis_module_peer_network_interface::network: disconnected from localhost:3002
2025-11-12T02:07:01.976640Z  INFO acropolis_module_peer_network_interface::network: setting preferred upstream to localhost:3001
2025-11-12T02:07:02.027124Z ERROR acropolis_module_peer_network_interface::connection: error connecting bearer: Connection refused (os error 111) peer="localhost:3001"
2025-11-12T02:07:02.027246Z  WARN acropolis_module_peer_network_interface::network: disconnected from localhost:3001
2025-11-12T02:07:02.027270Z  INFO acropolis_module_peer_network_interface::network: setting preferred upstream to localhost:3003
2025-11-12T02:07:02.036736Z ERROR acropolis_module_peer_network_interface::connection: error connecting bearer: Connection refused (os error 111) peer="localhost:3003"
2025-11-12T02:07:02.036790Z  WARN acropolis_module_peer_network_interface::network: disconnected from localhost:3003
2025-11-12T02:07:02.036800Z  INFO acropolis_module_peer_network_interface::network: setting preferred upstream to localhost:3002
2025-11-12T02:07:06.980642Z ERROR acropolis_module_peer_network_interface::connection: handshake protocol error peer="localhost:3002"
2025-11-12T02:07:06.980696Z  WARN acropolis_module_peer_network_interface::network: disconnected from localhost:3002
2025-11-12T02:07:06.980709Z  INFO acropolis_module_peer_network_interface::network: setting preferred upstream to localhost:3001
2025-11-12T02:07:07.030348Z ERROR acropolis_module_peer_network_interface::connection: handshake protocol error peer="localhost:3001"
2025-11-12T02:07:07.030421Z  WARN acropolis_module_peer_network_interface::network: disconnected from localhost:3001
2025-11-12T02:07:07.030433Z  INFO acropolis_module_peer_network_interface::network: setting preferred upstream to localhost:3003
2025-11-12T02:07:07.040029Z ERROR acropolis_module_peer_network_interface::connection: handshake protocol error peer="localhost:3003"
2025-11-12T02:07:07.040076Z  WARN acropolis_module_peer_network_interface::network: disconnected from localhost:3003
2025-11-12T02:07:07.040088Z  INFO acropolis_module_peer_network_interface::network: setting preferred upstream to localhost:3002
2025-11-12T02:07:11.981887Z ERROR acropolis_module_peer_network_interface::connection: handshake protocol error peer="localhost:3002"
2025-11-12T02:07:11.981942Z  WARN acropolis_module_peer_network_interface::network: disconnected from localhost:3002
2025-11-12T02:07:11.981958Z  INFO acropolis_module_peer_network_interface::network: setting preferred upstream to localhost:3001
2025-11-12T02:07:12.032279Z ERROR acropolis_module_peer_network_interface::connection: handshake protocol error peer="localhost:3001"
2025-11-12T02:07:12.032328Z  WARN acropolis_module_peer_network_interface::network: disconnected from localhost:3001
2025-11-12T02:07:12.032341Z  INFO acropolis_module_peer_network_interface::network: setting preferred upstream to localhost:3003
2025-11-12T02:07:17.034395Z ERROR acropolis_module_peer_network_interface::connection: handshake protocol error peer="localhost:3001"
2025-11-12T02:07:17.034450Z  WARN acropolis_module_peer_network_interface::network: disconnected from localhost:3001

Those logs indicate that the peer network interface is trying to connect to any of the three peers repeatedly, and failing every time. "Trying" is exactly the behavior we want to see, and "failing" with a "handshake protocol error" indicates that the other servers are running but not accepting connections.

I'm guessing that what happened here is that the nodes were force killed, rather than shut down gracefully. When that happens, the Haskell node usually has to re-verify its internal state by reprocessing the entire chain history, starting from genesis, before it can accept new connections. On the preview network, that takes... Like half an hour? So if I'm right, the system will recover if left alone long enough, and meanwhile acropolis is doing the right thing by repeatedly attempting to reconnect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multi-peer - Basic

3 participants